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TECHNICAL FIELD 
The invention resides in the technical fields of 

at::: r 9enetics ' ~ s - 



BACKGROUND 

reouir., The . Craditional ^oach to genome sequence analysis 

3 XseVmet r n ry ( SeqUenCe " be ****** ^ 

sele" ? (t ^ lcall r -i»S Applied Eiosystems DNA 

, 9 numbers of organisms. For this 

reason, relatively few individuals within a «„ ■ V 
sequenced to l ««v <= within a species have been 

o r 1 ^ P ° lym °^ hic variation. Furthermore, 

been ^^S^' ^ " ^ ^ * >™ 
J cc Co iarge- scale sequencing. 

analvzina T?" " * **• ^ficient means of 

e^en e 9 hlr:r rr nCeS ^ 3 ******* « reference 

pattern^ b ^Target T"" " 

oositio. „ 3 nucleic acid reveals the 

position, and optionally the nature of 

U» target and reference sequence For f"*™ b * t "" B 

describes arrays comprising" our probe .e^o "V""" 9 
the intensity ~p * . P e se ^ s - Comparison of 

sets to a taroet C °^™^ P-bes from, the four 

s to a target sequence .reveals the identity' of a 
corresponding nucleotide in the tamp , 

tne target sequences aligned wi,th 



interrogation position o£ the probes. The oorresponding 
nucleotide is the complement of the nucleotide occupying the 
interrogation position of the probe showing the highest 
intensity. " 

The existence of variation between a target and 
reference sequence can also be identified by differences in ~ 
normalized hybridization intensities of probes flanking the 
variation when the probes are respectively hybridized to 
target and reference sequences. Relative loss of 
hybridization intensity is manifested as a "footprint" of 
probes flanking the point of variation between target and 
reference sequence (see EP 717,113, incorporated by reference 
in its entirety for all purposes). Additionally 
hybridization intensities for multiple targets from different 

ources can be classified into groups or clusters suggested by 
the data, not defined a priori, such that isolates in a give 

tend 7 TV° SimiiSr ^ iS ° lateS in diff — ex- 
tend to be dissimilar (see WO 97/29212, incorporated by 

reference in its entirety for all purposes) . 

in the i rf A rray " baSed r6SeqUencin 9 heen used, for example, 

in TtolTl r' 1011 ° f lar9e nU ^ erS ° f ^ P^orphisms 
^mitochondrial DNA and ESTs, the identification of drug- 
induced mutations in HIV, and analysis of mutations in P 53 
correlated with human cancer. 

DEFINITIONS 
A nucleic acid is a deoxyribonucleotide or 
ribonucleotide polymer in either single-or double- stranded 
form, deluding known analogs of natural nucleotides unless 
otherwise indicated. 

An oligonucleotide is a single-stranded nucleic acid 
-ngrng in length fro. 2 to about SCO bases, and is typ cal y 
about. 8-40, and more typically, 10-25 bases. 

a tardea * t**'- 1 ' "^"nucleotide capable of binding to 
Irenes" f H "°" °' through one or 

base pair no T ^ thr ° Ugh "^^.entary 

base palrmg, usually through hydrogen bond formation. a7 



oligonucleotide probe .ay include nacural ( . fi 

Z A T baSeS <e - 9 - ^^-nosine, inosine,. la 

add -on, the bases in oligonucieotide probe may be joined by 

a linkage other than a phosphodiester bond, so long as it doL 

p n :: b :r rf T wich hybridi2a "° n - *»»• 

probes .ay be peptide nucleic acids in which the constituent- 
bases are Joined by peptide bonds rather than phosphod ester 
Usages. See Nielsen et a!., science 254, ^7-isoo U 99 " 
Specific hybridization refers to the binding 

5ucleot.de sequence under stringent conditions when that 
sequence ls present in a convex mixture (e.g., total 
cellular, DMA or RNA. stringent conditions are conditions 
under which a probe will hybridize to its target subsequence 
but to no other sequences, stringent conditions are sequence 
dependent and are different in different circumstances ^ 
Longer sequences hybridize specifically at higher 

TZTZ\ GSner : Uy ' S " in ^ nt are selected to 

specific I " thSrmal melting IW for the 

Is the t " " 3 defined i0ni ° SCren9Ch and PH- Tm 

nuclei acid concentration) at which 50% of the probes 
complementary to the target sequence hybridize to the target 
sequence at equilibrium (Aa the target mces 
generally present in excess, at Tm, 50* of the probes are 
occupied at equilibrium) t^^.h prooes are 

include a «-,r ""• TrPrcally, stringent conditions 

include a salt concentration of at least about 0.01 to 1 0 M 
Na ion concentration ; (or other salts, at pH 7 0 to 8 3 ^nd 1 
temperature is at least about 30-C for short probes ,e g » 

:: c r ~ttr of :::~ nditions = an a - J 

- example, co^eWoT^ I^^Tr' 
phosphate, 5 mM EDTA dH 7 <n „ 50 mM Na 

— .r 2s - 3ooc ~ 

A perfectly, matched Drobe» h« a ~ 
complementary to a oarticm v perfectl y 
base oairin/ P artlcu ^r target sequence., Complementary 

oase pairing means secruence-qn^if^ k ■ 
includes & a rj x. ^ specific base, pairing which 

e. g ., Watson-Crick' base pairing or other forms of * 



base pairing such as Hoogsteen base pairing. Probes typically 
have a segment of complementarity of 6-20 nucleotides, and 
preferably, 10-25 nucleotides. Leading or trailing sequences 
Hanking the segment of complementarity can also be present 
The term "mismatch probe" refer to probes whose sequence is' 
deliberately selected not to be perfectly complementary to a" 
particular target sequence. Although the mismatch (s) may be 
located anywhere in the mismatch probe, terminal mismatches 
are less desirable as a terminal mismatch is less likely to 
prevent hybridization of the target sequence. Thus, probes 
are often designed to have the mismatch located at or near the 
center of the probe such that the mismatch is most nv.i.. to 
destabilize the duplex with the target sequence u _ ae teat 

hybridization conditions. 

Polymorphism refers to the occurrence of two or more 
genetically determined alternative sequences or alleles in a 
population. A polymorphic marker or site is the locus at 
which divergence occurs. Preferred markers have at least two 
alleles, each occurring at frequency of greater than 1%, and 
more preferably greater than 10% or 20% of a selected 

population. A polymorphic locus may be as small as one base 
pair. 

An array including a pooled probe means that a cell 
in the array is occupied by pooled mixture of probes. For 
example, a cell might be occupied by probes ACCCTCCA and 
ACCCCCCA, in which case, the underline position is described 
as a pooled position. Although the identity of each probe in 
the mixture is known, the individual probes in the pool are 
not separately addressable. Thus, the hybridization signal 
from a cell is the aggregate of that of the different probes 
Occupying the cell. 

• The term species variant refers to a gene sequence 
that is evolutionarily and functionally related between 
species. For example, in the human genome, the human CD4 gene 
is the cognate gene to the mouse CD 4 gene, since the sequences 
and structures of these two genes indicate that they are 
highly .homologous and both genes encode a protein which 



restricted Si9na11 " 9 T " Ce11 aC "-"™ through "HC class If. 
restricted antigen recognition. 

ODtim.1, P ^ Centa9e se <5" e nce identity is determined between 

of a7 t ^ Se9UenCeS £r ° m i-Dl—nctZi 
of algorithms such as GAP. bestfit, FASTA, and TFAS331 l„ Z 

Wisconsin Genetics Software Package Release 7.0, 

Computer Group, S75 Science Dr., Madison, WI 



SUMMARY OF THE CLAIMED INVENTION 

In., • inVenUon Prides iterative methods of 

analyzmg . targec sscpiencSi which 

he array of ^ ^» . is hybridised to 

of the k P relatiTC ^ridization intens<M e3 

of the probes to the t*r 9 et nu.Xeic Jcl<1 are ^ ieW ^ 

The relative hybridization intensities are used to estTlte a 
sequence of the target nucleic acid. A further array T 
probes is then provided comprising a probe set comprising 
^L:: ^ " 6S "— —- of the target 

th tL array 6 of tar9e b ^ ^ * ^ ^ridized'to 

the probes to the ° " ^ ™ization of 

probes to the target nucleic acid is determined Th- 

-estimate seguence of the target nucleic ac c s'the Le 
sequence of the target nucleic acid 

-rget n»^"£fLT ' ^"^"^ US " Ul -lyzing a 

Typical] v r-v,^ «- sequence trom a primate. 

YPically, the target nucleic acid shows 50-99% seouenc. 
identity .with the reference' sequence The ' 11 
particularly i ■ .. ' ^ Uence ' The methods are also 

particularly useful • in situations where a'tamaf 

. ie a target sequence » 
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differs (rem a reference sequence by more than „ 

within a probe length. " e muCati °" 

The methods can readily accommodate a referent 
sequence of at least 1 or in kb reference 

substantially comolete h . 9 **" 3 C ° mplete ° r 

set for use L th! mlth ° r 3— A probe 
Probes tha Ire perf ^Tl^ tapping ' 

reference sequent ^ "in fu * "* ^ 

-at are perfectly complement^ to IIT^T* 
sequence. p the estimate 

In some methods, the sr r a « 
Probe sets, a first nv k Pr ° beS com Prises four 

probes each k ? "* COm P rise s a plurality of 

nuc e W e h e X P a r c°tL COmPr ; Sin9 3 ^ ° f " 

reference e; en y ITsT^ " 3 SUb — ~ °f the 

4 ence ' the segment includina af- io=o- 
interrogation position complementary to ! 
nucleotide in t-h- « =>"entary to a corresponding 

fourth z n s ; r ; :r: ce sequence - second ' chi - - 

each probe in C L first * for 

-rd and ^^JTt^ *» "» — • 

includefthe at leLt nUCle ° tides hereof that 

the at leas one Z ^ -ept that 

different ^^TllZT^ZT * 3 

from the four probe sets in k corresponding probes 

sequence can be estiltl k " meCh ° dS ' the target 
binding bf four correT d COTParin9 r * Uti ™ Spec «- 
third and fou«h pro 7 "» 
nucleit acid is tLn as ed as t h = " I ^ ^ 
. interrogation position of 'the p one ha^tT °' 
specific binding. other nucleotides " the t 
are assigned, by similar comparisons ^ 

target nuc^acir 0 " ^ Pr ° VldeS meth ° dS ° f ana1 ^ a 
of probes s deseed 9 £ ° ll0 " in9 ^ *» ^ 
sequence of the target C ° mpl — 'ary to an estimated 

- e target nucleic acid tk« 

hybridised to the tara^ i ■ ^ ° f probes is 

, cne target nucleic acid. The 

ine target sequence is 



reestimated from hybridization pattern of the array to the 
target nucleic acid. The steps are the repeated at least 



once 



DETAILED DESCRIPTION 

1. Genpral 



The invention provides improved methods for 
analyzing variants of a reference sequence using arrays of 
probes. The methods are particularly useful for target 
sequences showing substantial variation from a reference 
sequence, as may be the case where target sequence and 
reference sequence are from different species. The methods 
mvolve designing a primary array of probes based on a known 
reference sequence. Effectively, the reference sequence 
serves as a first estimate of sequence of the target nucleic 
add. The primary array of probes is hybridized to a tarqsi 
nucleic acid, and the sequence of the target is esWW as 
well as possible from its hybridization pattern ho & 6 primary 
array, a secondary array of probes is then designed based on 
the estimated sequence of the target nucleic acid. The target 
nucleic acid is then hybridized with the secondary array of 
probes, and the sequence is reestimated from the resulting 
hybridization pattern. Further cycles of array design and 
estimation of target sequence can be performed in an iterative 
fashion, if desired, until the estimated sequence is constant 
between successive cycles.. 

•2-* Referenc e Serp iAnr-oa 

Reference sequences for polymorphic site 
identification are often obtained from computer databases such 
as Genbank, the Stanford Genome Center, The Institute for 
Genome Research and the 'Whitehead Institute. „The latter 
databases are available at http://www-genome.wi.mit.edu; 
http://shgc.stanford.edu and http://ww.tigr.org. Reference 
sequences., are typically from well -characterized organisms, 
such as human, mouse, C elegans, Arabidopsis,, Drosophila, 
yeast, E. .coli or Bacillus subtilis. A reference sequence can 
vary ln length from 5 bases to at least 1,000,000 bases. ' 
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References sequences are often of the order of 100-10,000 
bases. The reference sequence can be from expressed or 
nonexpressed regions of the genome. In some methods, in which 
RNA samples are used, highly expressed reference sequences are 
sometimes preferred to avoid the need for RNA amplification 
The function of a reference sequence may or may not be known- 
Reference sequences can also be from episomes such as 
mitochondrial DNA. Of course, multiple reference sequences 
can be analyzed independently. 

^ Target Nucleic: Ari d Samp] e Preparation 

Targets can represent allelic, species, induced or 
other variants of reference sequences. Considerable diversity 
is possible between reference and target sequence. Target 
sequences usually show between 50-99%, 80-98%, 90-95% sequence 
identity. For example, a human reference sequence can be used 
as the starting point for analysis of primates, such as 
gorillas, orangutans, other mammals, reptiles, birds, plants, 
fungi or bacteria. 

The nucleic acid samples hybridized to arrays can be 
genomic, RNA or cDNA. Nucleic acid samples are usually 
subject to amplification before application to an array An 
individual genomic DNA segment from the same genomic location 
as a designated reference sequence can be amplified by using 
primers flanking the reference sequence. Multiple genomic 
segments corresponding to multiple reference sequences can be 
prepared by multiplex amplification including primer pairs 
flanking each reference sequence in the amplification mix. 
Alternatively, the entire genome can be amplified using random 
primers (typically hexamers) (see Barrett et al.,. Nucleic 
Acids Research 23, 3488-3492 (1995)) or by fragmentation and 
reassembly (see,. e.g., Sterner et al . , Gene 164, 49-53 
(1995) ) . Nucleic acids can also be amplified by cloning into 
vectors and propagating the vectors in a suitable organism. 
YACs, ;BACs and HACs are useful for cloning large segments of 
genomic DNA. 

„ • Genomic DNA can be obtained from virtually any 
tissue source (other than pure red blood cells) . For example, 
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convenient tissue samples include whole blood, semen, saliva, 
tears, urine, fecal material, sweat, buccal, skin and hair. 

RNA samples are also often subject to amplification. 
In this case amplification is typically preceded by reverse 
transcription. Amplification of all expressed mRNA can be 
performed as described by commonly owned WO 96/14839 and WO ~ 
97/01603. In some methods, in which arrays are designed to 
tile highly expressed sequences, amplification of RNA is 
unnecessary. The choice of tissue from which the sample is 
obtained affects the relative and absolute levels of different 
SNA transcripts in the sample. For example, cytochromes P450 
are expressed at high levels in the liver. 

±j Methods of amnlif ication 

The PCR method of amplification is described in. PCR 
Technology: Principles and Applications for DNA Amplification 
(ed. H.A. Erlich, Freeman Press, NY, NY, 1992); PCR Protocols: 
A Guide to Methods and Applications (eds. Innis, et al., 
Academic Press, San Diego., CA, 1990); Mattila et al . , Nucleic 
Acids Res. 19, 4967 (1991); Eckert et al., PCR Methods and 
Applications. 1, 17 (1991); PCR (eds. McPherson et al . , IRL 
Press, Oxford); and U.S. Patent 4,683,202 (each of which is 
incorporated by reference for all purposes) . Nucleic acids in 
a target sample are usually labelled in the course of. 
amplification by inclusion of one or more labelled nucleotides 
in the amplification mix. Labels can also be attached to 
amplification products after amplification e.g., by end- 
labelling. The amplification product can be RNA or DNA 
depending on the enzyme and substrates used in the 
amplification reaction. 

Other suitable amplification methods include the 
ligase chain reaction (LpR) (see Wu and Wallace, Genomics 4, 
560 (1989)', Landegren et al . , Science 241, 1077 (1988), 
transcription amplification (Kwoh et al . , Proc. Natl. Acad. 
Sci. USA 86, 1173 (1989.)), and self-sustained sequence 
replication (Guatelli et .al.',. Proc. Nat. Acad. Sci.. USA, 87, 
1874 (1990)) and nucleic acid based sequence amplification 
(NASBA) . The latter, two amplification methods involve ♦ 
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isothermal reactions based on isothermal transcription, which 
produce both single stranded RNA (ssRNA) and double stranded 
DNA (dsDNA) as the amplification products in a ratio of about 
30 or 100 to 1, respectively. 
5 . Probe Arrays 

An array of probes contain at least a first set of~" 
probes that are complementary to a reference sequence (or 
regions of interest therein) . Typically, the probes tile the 
reference sequence. Tiling means that the probe set contains 
overlapping probes which are complementary to and span a 
.-.region of interest in the reference sequence. For example, a 
probe set might contain a ladder of probes, each of which 
differs from its predecessor in the omission of a 5' base and 
the acquisition of an additional 3* base. The probes in a 
probe set may or may not be the same length. The number of 
probes can vary widely from about 5, 10, 20, 50, 100, 1000, to 
10,000 or 100,000. Typically, the arrays do not contain every 
possible probe sequence of a given length. 

Often tiling arrays have four probe sets, as 
described in WO 95/11995. The first probe set comprises a 
plurality of probes exhibiting perfect complementarily with a 
reference sequence, as described above. Each probe in the 
first probe set has an interrogation position that corresponds 
to a nucleotide in the reference sequence. That is, the 
interrogation position is aligned with the corresponding 
nucleotide in the reference sequence, when the probe and 
reference sequence are aligned to maximize complementarily 
between the two. For each probe in the first set, there are 
three corresponding probes from three additional probe sets. 
Thus, there are four probes corresponding to each nucleotide 
.in the reference sequence. The probes from the three 
additional probe sets are identical to the corresponding probe 
from the first probe set except at the interrogation position, 
which occurs in the same position in each of the four 
corresponding probes from the four probe sets, and is occupied 
by a' different nucleotide in the four probe sets. 

A substrate bearing the four probe sets is 
hybridized to a labelled target sequence,, which shows 
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substantial sequence similarity with the reference sequence, 
but which may differ due to e.g., species variations. The 
amount of label bound to probes is measured. Analysis of the 
pattern of label revealed the nature and position of 
differences between the target and reference sequence. For 
example, comparison of the intensities of four corresponding- 
probes reveals the identity of a corresponding nucleotide in 
the target sequences aligned with the interrogation position 
of the probes. The corresponding nucleotide is the complement 
of the nucleotide occupying the interrogation position of the 
Brobe showing the highest intensity. The comparison can be 
performed between successive columns of four corresponding 
probes to determine the identity of successive nucleotides in 
the target sequence. 
15 In. many instances of comparing four corresponding 

probes, one of the four probes clearly has a significantly 
higher signal than the other three, and the identity of the 
base in the target sequence aligned with the interrogation 
position of the probes can be called with substantial 
20 certainty. However, in some instances, two or more probes may 
show similar but not identical signals. In these instances, 
one can simply score the position as ambiguous. 
Alternatively, can still call a base from the probe that has 
the higher signal but must recognize' a significant possibility 
25 of error. In general, if the ratio of signals of two probes 

is less than 1.2, a base call has a significant possibility of 
error. Ambiguous positions are most frequently due to closely 
spaced multiple points of variation between target and 
. reference sequence (i.e., within a probe length). Ambiguities 
30 can also arise due to low hybridization intensity because of 
base composition effects: 

A secondary array of probes is constructed based on 
the same principles as the first array, except' that the first 
probe set is tiled based on the newly estimated sequence 
35 rather than the original reference sequence. In general, the 
estimated, sequence includes ' the. best estimate of base present 
at positions of ambiguity as 'noted above > If there is equal 
probability of two or 'more .bases occupying a particular , 
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position in the estimated sequence, one can arbitrarily decide 
to include one of the bases, provide alternate tilings 
corresponding to the different possible bases, or include 
multiple pooled bases at the position. The secondary array 
typically has second, third and fourth probe sets designed 
according to the same principles as in the primary array. 

The secondary array is hybridized to the same target 
nucleic acid as was the primary array. Bases in the target 
sequence are called using the same principles as described 
above by comparison of probe intensities to give rise to a 
reestimated target sequence. 

The process can be repeated through further 
iterations, if desired. Further iteration is desirable if the 
estimated sequence contains a substantial number of positions, 
which have been estimated with a low degree of confidence 
(e.g., from a comparison of probe intensities differing by a 
factor of less than 1.2). After sufficient iterations, the 
estimated sequence from one cycle should converge with that 
from the subsequent cycle. In some instances, positions of 
ambiguities may remain through many cycles. These positions 
may be due to effects such as heterozygosity, and should be 
checked by other means (e.g., conventional dideoxy sequencing 
or de novo sequencing by hybridization to a complete array of 
probes a given length) . 

Many variations in array design and analysis are 
possible, as described for example in WO 95/11995; EP 717,113; 
WO 97/29212. Optionally, arrays tile both strands of a 
reference' sequence. Both strands are tiled separately using 
the same principles described above, and the hybridization 
patterns of the two tilings are analyzed separately. 
.Typically, the hybridization patterns of the two strands 
indicates the same results (i.e., location and/or nature of 
variation between target sequence and reference sequence). 
Occasionally, there may be an apparent inconsistency between 
the hybridization patterns of the two strands due to, for 
example, base-composition effects on hybridization 
intensities. Combination of results from the two strands 
increases the probability of correct base calling and can ' 
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decrease the number of iterations required to determine the 
correct base sequence of a target . 

In a further variation, duplicate arrays are 
synthesized to allow analysis of hybridization between target 
5 sequence and probes under conditions of high and low 

stringency. Although high stringency is generally most 
useful, there are some regions of target sequence where the 
absolute hybridization intensity is low due to base 
composition effects, which yield base calls with a higher 
10 degree of confidence under conditions of low stringency. 

Statistical combination of base calls from conditions of high 
and low stringency can increase the overall probability of 
correct base calling. 

15 6. Synthesis and Scanning of Probe Arrays 

Arrays of probe immobilized on supports can be 
synthesized by various methods. A preferred methods is 
VLSIPS™ (see Fodor et al . , US 5,143,854; EP 476,014, Fodor et 
al., 1993, Mature 364, 555-556; McGall et al . , USSN 

20 08/445,332), which entails the use of light to direct the 
synthesis of oligonucleotide probes in high-density, 
miniaturized arrays (sometimes known as chips) . Algorithms 
for design of masks to reduce the number of synthesis cycles 
are described by Hubbel et al., US 5,571,63 9 and US 5,593,83 9. 

25 Arrays can also be synthesized in a combinatorial fashion by 
delivering monomers to cells of a support by mechanically 
constrained flowpaths. See Winkler et al . , EP 624,059. 
Arrays can also be synthesized by spotting monomers reagents 
on to a support using an ink jet printer. See id.; Pease et 

30 al. , EP 728,520. 

After hybridization of control and target samples to 
an array containing one ,or more probe sets 'as described above 
and optional washing to remove unbound and nonspecif ically 
bound probe, the hybridization intensity for the respective 

35 samples is determined for each probe in the array. For 

fluorescent labels, hybridization intensity can be determined 
, by, for example, a scanning' confocal microscope in photon 

counting mode. Appropriate scanning devices are described by 
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e.g., Trulson et al . , US 5,578,832; Stern et al . , US 
5,631,734. 

7. Large-Scale Reseguencina 

The methods described above can be used for 
comparative analysis of whole genomes or substantial portions 
thereof. To illustrate, about 3 00 chips at 1 Mb/chip are 
required to sequence 10% of a mammalian genome (i.e., all the 
genes and a substantial amount of their surrounding sequence) . 
If 40 chips are synthesized on a common waver using a single 
mask, then only 8 mask designs are required per iteration. If 
10 iterations are required, then only 80 mask designs and a 
total of 3000 chips are made. 

Although an entire genome can be hybridized to a 
chip in a single experiment, it is often more useful to 
hybridize pools of cloned sequence representing - 1 Mb at a 
time. This can be done in the following way. A minimal 
overlapping set of physical clones is first obtained. For 
example, random bacterial artificial chromosome clones are 
generated, and ordered by hybridization or conventional 
methods. If necessary, regions mapping to related positions 
in the genome are determined. E.g., pools of clones are 
hybridized to an array of mapped markers. Pools of clones are 
then generated for hybridization (e.g., 300 pools if the 
resequencing capacity is 1 Mb/chip and 3 00 chip designs are 
used to analyze 1/I0th a mammalian genome) . 

8. Applications 

Some of the benefits of resequencing related genomes 

are : 

1) Correction of sequencing errors. These are often 
corrected by comparative analysis. For example, if an open 
reading frame in one genome is frameshifted in a second 
closely related genome, a sequencing error is usually the 
cause »of the difference. Any sequence differences detected 
can be* verified in the reference genome by simply checking the 
primary sequencing trace data, or by further analysis. 
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2) Identification of promoter sequences and genes. 
Functionally important elements tend to be conserved. 
Sometimes, functional elements that are difficult to identify 
by direct sequence analysis (such as small exons or regulatory 
sequences) are revealed by identifying relatively short 
segments that are tightly conserved between genomes. 

3) Analysis of sequences differences between 
•differences species allows correlation between form and 
function. For example, the sequence of chimpanzee and human 
differ by -1% overall. Further, the present methods allow 
comparison of a range of primate sequences, to see which 
sequences have evolved the most rapidly and which are highly 
conserved. 

It will be apparent from the above that the 
invention includes a general concept which can be expressed 
concisely as follows. The invention entails the use of 
iterative cycles of designing an array of probes to be 
complementary to an estimated sequence of a target nucleic 
acid, and using the hybridization pattern of the array to the 
target nucleic acid sequence to determine a more accurate 
reestimated target sequence. 

All publications and patent applications cited above 
are incorporated by reference in their entirety for all 
purposes to the same extent as if each individual publication 
or patent application were specifically and individually 
indicated to be so incorporated by reference. Although the 
present invention has been ^ described in some detail by way of 
illustration and example for purposes of clarity and 
understanding, it will be apparent that certain changes and 
modifications may be practiced within the scope of the 
appended claims. 
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